Progress in Computer Vision at the University of Massachusetts 1

نویسندگان

Allen R. Hanson

Edward M. Riseman

Howard Schultz

چکیده

This report summarizes progress in image understanding research at the University of Massachusetts over the past year. Many of the individual efforts discussed in this paper are further developed in other papers in this proceedings. The summary is organized into several areas: 1. 3D Site Modeling from Aerial Views 2. Terrest Terrain Reconstruction System 3. Terrain Classification /Force Monitoring 4. Content-based Image Indexing 5. Learning in Vision 6. Miscellaneous Related Research The research program at UMass has as one of its goals the integration of a diverse set of research efforts into systems that are ultimately intended to achieve robust, real-time image interpretation in a variety of vision applications. I. 3D Site Modeling from Aerial Views 1.1. The Ascender Site Modeling System Under the DARPA/ORD RADIUS program, UMass developed the ASCENDER system (Automated Site Construction, Extension, Detection and Refinement) for automatically populating a site model with 3D building models extracted from multiple, overlapping images (both nadir and oblique) of the site (Collins et al. 1996). The UMass design philosophy emphasizes modeldirected processing, rigorous 3D photogrammetric camera models, and fusion of information across multiple images for increased accuracy and reliability. The Ascender system has been 1 This work was supported in part by DARPA under AO # E658 and DARPA contract numbers DAAL02-91-K-0047 (via ARL), DACA76-92-C-0041 (via TEC), F30602-94-C0042 (via Rome Laboratory), by NSF grant number CDA8922572 and by Lockheed Martin under Subcontract/PO number RRM072030. transferred to Lockheed-Martin and the National Exploitation Laboratory (NEL) where it has been evaluated on classified data sets. The Ascender system acquires, extends and refines 3D geometric site models from aerial imagery with known parameters. To acquire a new site model, an automated building detector is run on one image to hypothesize potential building rooftops. Supporting evidence is located in other images via epipolar line segment matching in constrained search regions. The precise 3D shape and location of each building is then determined by multiimage triangulation, and shape optimization under constraints of 3D orthogonality, parallelness, colinearity and coplanarity of lines and surfaces. As new images of the site become available, they are matched to the partial site model and model extension and refinement procedures are performed to add previously unseen buildings and to improve the geometric accuracy of the existing building models. In this way, the system gradually accumulates evidence over time to make the site model more complete and more accurate. Based on initial experience in the evaluation at the NEL, major changes have been made to Ascender's control system. The original system used a single reference image to generate roof hypotheses in the form of polygons, and then used the remaining images to verify/reject buildings by constructing a 3D model. If a building hypothesis was not found in the reference image, the building would not be constructed even though it might be clearly visible in one or more of the other images. A new control strategy has been implemented under which all images are processed uniformly; polygons found in any image are used as the set of initial rooftop hypotheses from which the 3D reconstruction begins. Tests have been performed on a subregion of the Fort Hood dataset. Polygons were detected in seven images and redundant polygons eliminated on the basis of overlap. Each of the remaining polygons was then used to construct a 3D building model Models that had a side or height of less than 5 meters were eliminated. Using this scheme 92% of the 76 rooftop polygons were detected, leaving six polygons missed in all seven images. An additional 45 polygons represented false positives from either errors in the 2D grouping process that survived verification or the reconstruction of a cultural feature other than a building (parking areas, playing fields, etc.) that had errors in height due to limited support from the image set. I.2 Ascender II: Context Sensitive Control of Reconstruction Work on the Ascender I system demonstrated that the use of multiple strategies and 3D information fusion can significantly extend the range of complex building types that can be reconstructed (Jaynes et al. 1996). Under the DARPA APGD program, we are designing and building the successor to the Ascender system. The design approach is based on the observation that while many IU techniques function reasonably under constrained conditions, no single IU method works well under all conditions. Consequently, work on Ascender II is focusing on the use of multiple alternative reconstruction strategies from which the most appropriate strategies are selected by the system based on the current context. In particular, the new system will utilize a wider set of algorithms that fuse 2D and 3D information and make use of EO, SAR, IFSAR, and multispectral imagery during the reconstruction process. We believe that such a system will be capable of more robust reconstruction of three dimensional site models than has been demonstrated in the past and will significantly reduce the effort required by image analysts during the reconstruction process. This in turn will result in faster development of topical situational and visualization products of military significance. Ascender II is organized into two subsystems The IU components of the system responsible for manipulating image data are being constructed in the ARPA RCDE system, while the control and inferencing components are represented as Bayesian belief networks constructed using the Hugin system (Jensen 1996). Communication between the two subsystems is currently being supported by UNIX socket facilities using packets structured for this application. Control policies (strategies) are associated with the object classes represented in the network. Execution of a control policy results in the accumulation of evidence for/against the corresponding network node. This evidence is propagated through the network based on the Bayesian probability tables constructed as part of the knowledge base. Currently, a maximum uncertainty policy is used to select the next node in the network to be expanded, although more sophisticated mechanisms are being explored as part of the research. A second issue being examined relates to the appropriate granularity of the control policies and therefore the granularity of the reconstruction systems themselves. Finally, a major effort is underway to develop new and expanded reconstruction strategies and the IU procedures required to support them. I.3 Reconstruction Strategies Our focus under the DARPA APGD program will be on more general 3D reconstruction strategies that utilize multiple types of features (points, lines, surfaces), and that can be applied to a wide range of parameterized building classes. The systemlevel approach involves multiple alternative detection and reconstruction strategies, invoked by clear contextual cues, that combine a wider set of algorithms and features for generating and fusing 2D and 3D information. The strategies being developed will be utilize both monocular optical data as well as digital elevation data obtained from IFSAR or multi-view stereo reconstruction from EO data. Several new reconstruction strategies are being developed for this system. To take a single example, Ascender I rooftop hypotheses from a single optical image are projected into registered elevation data and used to trigger the application of a rooftop model matching strategy to the restricted subset of the data. The model matcher uses a knowledge base of approximately 12 parameterized rooftop models (including flat, peaked, and curved roof models) and matches by correlating a histogram of surface orientations derived from the data with the orientation pattern of the model surfaces on the Gaussian sphere. Initial experiments have been obtained with the Ascona/ISPRS "Flat Scene" [ftp://ftp.ifp.unistuttgart.de/pub/wg3/]. This scene contains several peaked roofs with different slopes and cluttered with gabled windows, chimneys, etc. In addition, the elevation data contains noise unavoidably introduced through the stereo reconstruction process. In this experiment, the top two models resulting from the correlation process were selected and fit to the constrained elevation data. The model with the lowest residual fit error was chosen for the final reconstruction. Excluding the six rooftops missed by the hypothesis generation phase, the remaining eleven rooftops were correctly classified as peaked roof buildings. After surface fitting the models using the initial parameters found during indexing, the average residual fit error was 0.192 meters. See Jaynes et al. (1997) in these proceedings for more detail. 1.4 Reconstruction Strategies from IFSAR As part of an ORD-sponsored feasibility study, several approaches to bottom-up reconstruction of spatially coherent structures from IFSAR data have been explored. The strategies are being extended under the APGD program for inclusion in Ascender II. In one approach, building footprints extracted from optical data are used as a focus of attention mechanism to select subsets of the IFSAR data. This is followed by the application of robust 3D surface fitting techniques to the elevation data. In a second approach, variations on traditional region growing methods are applied to the IFSAR data alone. Geometric constraints can be imposed during region growing to produce rectangular or rectilinear shapes. These ideas have been explored using the initial Sandia/Kirtland AFB dataset; this dataset has since been characterized as being particularly noisy with a large number of drop-outs and outliers. This effort is described in more detail in (Hoepfner et al. 1997; Jaynes et al. 1997). 1.5 Surface Microstructure Extraction from Multiple Aerial Images Building surfaces with microstructures provides important information for many military and civilian applications. The extraction of small scale information from aerial imagery is difficult due to problems caused by perspective distortion, data deficiencies, and shadows and occlusions. A subsystem has been developed for improved extraction of site model details, often at a scale close to the limits of image resolution. An Orthographic Facet Image Library (OFIL) system and a generic window and door extraction module has been constructed under the assumption that an initial site model is available, and sufficient camera and light source information is known. The OFIL system is designed to systematically collect the building facet intensities from multiple aerial images into an organized orthographic library, eliminate the effects of shadows and occlusions, and combine the intensities from different sources to form a complete and consistent intensity representation for each facet. A 'Best Piece Representation' algorithm is designed to combine intensities from multiple views, resulting in a unique surface intensity representation. The window extraction module focuses attention on wall facets, attempting to extract the 2-D window and door patterns attached to the walls. The algorithms are typically useful in urban sites. Experiments show successful applications of this approach to site model refinement and improved fly-through scene visualizations; details can be found in (Wang et al. 1997). 2. Terrest Terrain Reconstruction System The UMass Terrain Reconstruction System (Schultz 1994) deals effectively with highly oblique viewing conditions using a texture correlation algorithm that incorporates (1) hierarchical unwarping, (2) weighted crosscorrelation and (3) narrow search subpixel registration. Recent extensions of Terrest (Schultz, Stolle et al. 1997) to site modeling applications involve the incorporation of boundary constraints into the correlation masks. The windows of a correlation mask are restricted to be completely on either side of a boundary, thereby causing the mask to be adaptive to the context in which it is applied. For example, with building roofs, correlation masks are automatically shaped to lie entirely inside or outside the area where a polygonal boundary has been detected (Quam 1984). The net result is significantly sharpened digital elevation maps (DEMs). 2.2 Automated Bundle Adjustment An effort is underway to automatically select image match points as a precursor to the bundle adjustment process necessary to precisely register images and to compute precise relative camera orientation. The approach we are developing uses building corners as the feature points of choice. These features are weighted according to their distinguishabilty and correlation masks are generated from features of high confidence. We are examining the question of whether or not the correlation peaks resulting from the use of these features during matching can be analyzed in order to distinguish between true correspondences and false matches. Since the degree of precision obtained in the camera pose parameters depends strongly on the accuracy of the match point locations, a robust estimation technique is used to remove outliers, i.e. false correspondences, which passed the local tests. The final step is a least-squares technique to obtain the final relative orientation. Typically 20 to 30 distinctive points are required for accurate results. 3. Algorithms to Support Force Monitoring 3.1 Using Three-Dimensional Features to Improve Terrain Classification Image texture has long been regarded as the spatial distribution of gray-level variation, and texture analysis has generally been confined to the 2D image domain. We have demonstrated the utility of "3-D world texture'' as a function of 3-D structures (Wang et al. 1997)and proposed a set of 3-D textural features. The proposed 3-D features have a great potential in terrain classification. Experiments have been carried out to compare the 3-D features with a traditional 2-D feature set. The results show that the 3-D features significantly outperform the 2-D features in terms of classification accuracy and training data reliability. The classifications have been used to generate ground cover maps and a skeletal road network. 3.2 Visibility Analysis for Force Monitoring Visibility analysis algorithms have been developed for a variety of force monitoring scenarios, including stealth path planning, placing a set of observers on an elevation map to maximize spatial coverage, and for analyzing when/where a force of a given size would be detected over a given line of advance. Our theoretical work is based on the Art Gallery Problem, which is the problem of determining the number of observers necessary to cover an art gallery such that every point is seen by at least one observer. A polynomial time solution has been developed for the 3-D version of the Art Gallery Problem. Because the problem is NP-hard, the solution presented is an approximation, and bounds to the solution are presented. Our solution uses techniques from computational geometry, Graph coloring and set coverage. A complexity analysis for each step and an analysis of the overall quality of the solution has been derived (Marengoni et al. 1996). This general algorithm has been applied to several problems in visibility analysis on an elevation map. 4. Content-based Image Indexing 4.1 Center for Intelligent Information Retrieval (CIIR) The Center for Intelligent Information Retrieval (CIIR) conducts leading basic research in the area of information systems. This national center is one of only four centers in science and engineering to be funded in 1992 by the National Science Foundation under its State/Industry University Cooperative Research Centers program. One of the goals of the CIIR is to develop tools that provide effective and efficient access to large, heterogeneous, distributed, text and multimedia databases. A new partnership between the Computer Vision Lab and CIIR is focused on content-based multimedia indexing and retrieval, a difficult yet vitally important task. The aim of content-based retrieval is to efficiently find images which contain the object represented in a query image in a large database. 4.2 Appearance-Based Indexing & Retrieval A system to retrieve images using a syntactic description of appearance has been developed and appears in these proceedings (Ravela and Manmatha 1997). A multi-scale invariant vector representation of images in the database is obtained by first filtering with Gaussian derivative filters at several scales and then computing low order differential invariants; this done off-line. Run-time queries are designed by the users from an example image by selecting a set of salient regions. The responses corresponding to these regions are matched with those of the database and a measure of fitness per image in the database is computed in both feature space and coordinate space. The results are then displayed to the user sorted by the match score. From experiments conducted with over 1500 images it has been shown that images similar in appearance, and whose viewpoints are within 25 degrees of the query image, can be effectively retrieved. 4.3 Color-Based Indexing & Retrieval A new multi-phase, color-based image retrieval system (FOCUS) has been developed which is capable of identifying multi-colored query objects in an image in the presence of significant, interfering backgrounds. The query object may occur in arbitrary sizes, orientations and locations in the database images. The color features used to describe an image have been developed based on the need for speed in matching and ease of computation on complex images while maintaining the scale and rotation invariance properties. The first phase matches the color content of an image with the query object colors using an efficient indexing mechanism. The second phase matches the spatial relationships between color regions in the image with the query using a spatial proximity graph (SPG) structure designed for the purpose. The method is fast and has low storage overhead. Test results with multicolored query objects from man-made and natural domains show that FOCUS is quite effective in handling interfering backgrounds and large variations in scale (Das and Riseman 1997). 4.4 Text Detection & Extraction in Images There are many applications in which the automatic detection and recognition of text embedded in images is useful. These applications include multimedia systems, digital libraries, and Geographical Information Systems. However, text is often printed against shaded or textured backgrounds or is embedded in images. Examples include maps, advertisements, photographs, videos, and stock certificates. Current OCR and other document recognition technology cannot handle these situations well. A four-step system has been developed that automatically detects and extracts text from images by treating it as a texture (Wu et al. 1997). Potential text locations are found by filtering second-order derivatives of Gaussians at three different scales. Second, vertical strokes from horizontally aligned text regions are extracted. Based on several heuristics, such as height similarity, spacing and alignment, strokes are grouped into tight rectangular bounding boxes around text strings. These steps are then applied to a pyramid of images generated from the input images in order to detect text over a wide range of font sizes, and then the boxes are fused at the original resolution. In a third step, the background is cleaned up and the image is converted to binary. Finally, text bounding boxes are refined (repeating steps 2 and 3) by using the extracted items as strokes. The final output produces two binary images for each text box which can then be passed to any standard OCR software. The system has been tested on images from a wide variety of sources, including newspapers, magazines,, photographs, digitized video frames, etc. Of the 21820 characters and 4406 words in these test images, 95% of the characters and 93% of the words have been successfully extracted by the system. Of these 14703 characters and 2981 words are believed to be OCR-readable fonts, and 84% of the characters and 77% of the words are successfully recognized by a commercial OCR system. 5. Learning in Vision 5.1 New formulation of Control Policies based on Markov Decision Processes and Reinforcement Learning The original paradigm for SLS was to use decision trees (or other classifiers) to evaluate intermediate data results at each level of representation, and to use some mechanism to choose how to transform data from one level of representation to the next. We developed a model for applying reinforcement learning to object recognition in which each level of representation is viewed as a continuous feature space defined by its measurable attributes, and control policies are learned using a combination of reinforcement learning and neural networks that map points in the feature spaces onto optimal actions. An initial implementation of this new approach was completed in December, 1995. In the first major test of this system, it was 10-for-10 in recognizing rooftops in aerial images of Ft. Hood, TX. These results (which use a reduced library of visual procedures) were reported in (Draper 1996).. Prof. Bruce Draper has joined Colorado State University and will continue developing this work in reinforcement learning in applications related to the Automatic Population of Geospatial Databases (APGD) program. 5.2 Real-time interactive classification Manual generation of training examples for supervised learning is an expensive process. One way to reduce this cost is to produce training instances interactively that are highly informative. The feasibility of such an approach has been demonstrated on an image pixel classification task that is the front-end to many higher level reasoning applications that can make useful inferences about the contents of the image. However, the construction of pixel classifiers is a labor-intensive task involving user interaction to manually select feature sets, manually select local training data for each desired object class, and then to provide feedback as a result of classification for additional refinement until satisfactory global classification is achieved. Thus, the prototype system we have implemented is an exploration into a new classification paradigm. We have developed a prototype interactive tool (Piater and Utgoff 1997) that allows the user to immediately see the result of selecting incremental training data so that he can adjust the further selection on the basis of inaccurate classification. This system shows that the incremental classifier converges to satisfactory performance after a very small number of training instances and required only a fraction of the typical human effort to provide them. This suggests an interactive real-time 3D visualization tool for incremental classification of terrain in aerial images. This now allows interactive training of the classifier with the user examining the world data from more natural and understandable viewpoints that show the sensor data in the context of its three-dimensional characteristics, e.g. from a 45 degree downward oblique view, where sides of objects and terrain are more understandable. Classification results can then be rapidly overlaid onto the terrain model using a variety of graphic display techniques, and with incremental real-time updating of training data. 6. Miscellaneous Related Research 6.1 Persistent Data Management for Visual Applications Visual applications need to represent, manipulate, store, and retrieve both raw and processed visual data. Existing relational and object-oriented database systems fail to offer satisfactory visual data management support because they lack the kinds of representations, storage structures, indices, access methods, and query mechanisms needed for visual data. We have previously argued that extensible visual object stores offer feasible and effective means to address the data management needs of visual applications (Draper 1993; Draper). Such a visual object store is under development at the University of Massachusetts for the management of persistent visual information. ISR4 is designed to offer extensive storage and retrieval support for large, complex sets of visual data , customizable buffering and clustering, and spatial and temporal indexing, along with a variety of multi-dimensional access methods and query languages. 6.2 Segmentation of Stroke Lesions in MRI A collaborative exploratory project with Baystate Medical Center is underway for analysis of stroke lesions in the brain scans (Piater 1996). The goal of this clinical study is the volumetric analysis of damaged cells for people who have suffered an acute ischemic stroke, and their response over time to various forms of treatment involving the lowering of blood pressure in the period immediately following the stroke. This requires the segmentation of brain lesions where there is generally a core of dead tissue (infarct) and a surrounding area of damaged tissue that either might recover or die (penumbra). The change in the size of the lesion over a varying period of time (several days, weeks, and/or months) will be correlated with qualitative assessment of patient functionality and the different forms of treatment 6.3 Weighted Bipartite Matching for 3D Correspondence and Rigid 3-D Motion A closed form solution has been developed for the problem of determining correspondences between two sets of 3D points for which the number of points in the sets is not the same (Cheng, Wu et al. 1996). This is the general 3D rigid motion problem and the solution is based on a decomposition of the correlation matrix eigenstructure. Using a heuristic measure of point pair affinity derived from the eigenstructure, a weighted bipartite matching algorithm has been developed to determine the correspondences in general cases where missing points occur. The use of the affinity heuristic also leads to a fast outlier removal algorithm, which can be run iteratively to refine the correspondence recovery.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Keynote lecture 1: "Visual scene understanding - It's time to address it again"

Inspired by the ability of humans to interpret and understand 3D scenes nearly effortlessly, the problem of 3D scene understanding has long been advocated as the "holy grail" of computer vision. In the early days this problem was addressed in a bottom-up fashion without enabling satisfactory or reliable results for scenes of realistic complexity. In recent years there has been considerable prog...

متن کامل

آشنایی با دستگاه های ورودی و خروجی رایانه ویژه آسیب دیدگان بینایی

Blind and low vision students as a group of special needs students can use computer. The progress of technology with the combination of Braille and conversion of text to speech caused to make technological equipments and input / output devices of computer for Blind and low vision students. In this issue, we introduce these devices and their recent models in order to design them.

متن کامل

Elephants don't play chess

Rodney A. Brooks was born in Adelaide, Australia. He studied Mathematics at the Flinders University of South Australia and received a Ph.D. from Stanford in Computer Science in 1981. Since then he has held research associate positions at Carnegie Mellon University and the Massachusetts Institute of Technology and faculty positions at Stanford and M.I.T. He is currently an Associate Professor of...

متن کامل

Computer assisted instruction during quarantine and computer vision syndrome

Computer vision syndrome (CVS) is a set of visual, ocular, and musculoskeletal symptoms that result from long-term computer use. These symptoms include eyestrain, dry eyes, burning, pain, redness, blurred vision, etc, which increase with the duration of computer use. Currently, with the closure of schools and universities due to the continued COVID19 pandemic many universities have taken the pr...

متن کامل

Impacts of Mothers’ Occupation Status and Parenting Styles on Levels of Self-Control, Addiction to Computer Games, and Educational Progress of Adolescents

Background: Addiction to computer (video) games in adolescents and its relationship with educational progress has recently attracted the attention of rearing and education experts as well as organizations and institutes involved in physical and mental health. The current research attempted to propose a structural model of the relationships between parenting styles, mothers’ occupation status, a...

متن کامل

Satya Layout

o began Mark Weiser’s seminal 1991 paper [1] that described his vision of ubiquitous computing, now also called pervasive computing. The essence of that vision was the creation of environments saturated with computing and communication capability, yet gracefully integrated with human users. When articulated, this was a vision too far ahead of its time — the hardware technology needed to achieve...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

Progress in Computer Vision at the University of Massachusetts 1

نویسندگان

چکیده

منابع مشابه

Keynote lecture 1: "Visual scene understanding - It's time to address it again"

آشنایی با دستگاه های ورودی و خروجی رایانه ویژه آسیب دیدگان بینایی

Elephants don't play chess

Computer assisted instruction during quarantine and computer vision syndrome

Impacts of Mothers’ Occupation Status and Parenting Styles on Levels of Self-Control, Addiction to Computer Games, and Educational Progress of Adolescents

Satya Layout

عنوان ژورنال:

اشتراک گذاری